Handle pandas timestamp with nanosecs precision #49370

srinathk10 · 2024-12-19T21:54:35Z

Why are these changes needed?

Handle pandas timestamp with nanosecs precision

Related issue number

"Closes #49297"

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

omatthew98

Few comments about the tests, I also wonder if we could somehow fold all these tests into a single parameterized test that is agnostic of the type (e.g. use the expected type in the parameters somewhere to know what type to expect. Would help with the standardization across the tests.

omatthew98 · 2024-12-19T22:08:53Z

python/ray/data/_internal/table_block.py

+                # Convert Python datetime to pandas Timestamp with nanosecond precision
+                value = pd.Timestamp(value)
+                value = pa.array([value], type=pa.timestamp("ns"))[0]
+


Are there any ways that this conversion can fail or precision can be lost? I imagine probably not but might be good to note anything here if there are.

Hey again, piping in from the original issue.

If this is already a datetime object, hasn't it already lost nanosecond precision? Is it too late to perform this coercion?

Default behavior is type coercion as we discovered in the bug from the test case. This is explicitly handling the type conversion for nanoseconds.

omatthew98 · 2024-12-19T22:11:29Z

python/ray/data/tests/test_map.py

+    "df, expected_df",
+    [
+        (
+            create_timestamp_dataframe(),


Unless I am missing something this looks identical to the result of create_timestamp_dataframe? Should we just call that function once? Should we only pass in one dataframe?

Ah I see now that there is a 1ns difference, can we include a comment to specify that? Or could just have one DF passed in and do the addition of 1ns in the test itself?

I will check if I can simplify this without limiting the correctness check.

omatthew98 · 2024-12-19T22:17:38Z

python/ray/data/tests/test_map.py

+    processed_df = result.to_pandas()
+    assert processed_df.shape == df.shape, "DataFrame shapes do not match"
+    pd.testing.assert_frame_equal(processed_df, expected_df)
+    assert (processed_df["timestamp"] - df["timestamp"]).iloc[0] == pd.Timedelta(


Are these last three asserts redundant with the above equality check?

Ah this is checking for all 3 rows.

omatthew98 · 2024-12-19T22:18:19Z

python/ray/data/tests/test_map.py

+    result = ray_data.map(process_timestamp_data)
+    processed_df = result.to_pandas()
+    # Ensure numpy.datetime64 is correctly converted to pandas Timestamp
+    assert isinstance(processed_df["timestamp"].iloc[0], pd.Timestamp)


Should we just check if the series for the column has the correct type?

omatthew98 · 2024-12-19T22:21:13Z

python/ray/data/tests/test_map.py

+    assert isinstance(processed_df["timestamp"].iloc[0], pd.Timestamp)
+
+    # Check that the timestamp has been incremented by 1ns
+    assert (processed_df["timestamp"] - df["timestamp"]).min() == pd.Timedelta(1, "ns")


Can we standardize how we are checking the diff of 1ns? Seems like we are doing this, the equality check with the expected check and also individually checking the differences. Not super opinionated on which is the clearest but let's just use one (unless I am missing some subtlety here)
.

This one is datetime vs np.datetime64. Let me consolidate this code better.

omatthew98

Thanks for cleaning up the tests! We could probably consolidate the two slightly more but approving regardless to unblock.

omatthew98 · 2024-12-26T19:20:13Z

python/ray/data/tests/test_map.py

+        )
+    ],
+)
+def test_map_numpy_datetime(df, expected_df, ray_start_regular_shared):


Nit: It seems like test_map_numpy_datetime and test_map_timestamp_nanosecs have the same body? Can probably move this to be one test with two different sets of parameters.

raulchen · 2024-12-26T19:48:23Z

python/ray/data/_internal/table_block.py

+            if isinstance(value, (pd.Timestamp, np.datetime64)):
+                # If it's a pandas Timestamp or numpy datetime64, convert to pyarrow
+                # Timestamp
+                value = pa.array([value], type=pa.timestamp("ns"))[0]


if the target block type is pandas, will converting it to pyarrow Timestamp be compatible?

We should probably put this in the arrow_block.py subclass

also, can you comment that the purpose of this conversion is to avoid losing precision?

if the target block type is pandas, will converting it to pyarrow Timestamp be compatible?

Here in add API, we are preventing implicit type coercion when we handing Pandas timestamp (nanoseconds) type while converting to Pyarrow Table.

Given Pyarrow is holding precision in Nanoseconds, converting Pyarrow -> Pandas does retain nanoseconds precision. Pytests convert the returned resultset (ds.take_all()) to Pandas and does verification of timestamp nanosecs as well.

raulchen · 2024-12-26T19:54:28Z

python/ray/data/tests/test_map.py

+        )
+    ],
+)
+def test_map_timestamp_nanosecs(df, expected_df, ray_start_regular_shared):


can you document the purpose of each test?

maybe put them in test_pandas.py, as the issue is specific to Pandas.

Signed-off-by: Srinath Krishnamachari <[email protected]>

raulchen · 2025-01-03T18:38:10Z

python/ray/data/_internal/numpy_support.py

-        dtype=f"datetime64[{precision}]",
-    )
+    # Manually handle nanoseconds if the precision is 'ns'
+    def convert_to_datetime64(dt: datetime) -> np.datetime64:


nit, would be cleaner to define it as a standalone function

raulchen · 2025-01-03T18:39:27Z

python/ray/data/tests/test_pandas_block.py

+        pandas_builder.add(row)
+    pandas_block = pandas_builder.build()
+
+    # assert pd.api.types.is_datetime64_ns_dtype(pandas_block["col2"])


is it still needed?

Oops this needs to be uncommented.

raulchen · 2025-01-03T18:41:27Z

python/ray/data/_internal/numpy_support.py

@@ -45,28 +45,76 @@ def validate_numpy_batch(batch: Union[Dict[str, np.ndarray], Dict[str, list]]) -


 def _detect_highest_datetime_precision(datetime_list: List[datetime]) -> str:
+    """Detect the highest precision for a list of datetime objects.


didn't know there is such a function. This looks nicer than the previous approach. 👍

omatthew98

Two comments to clean up the _convert_datetime_list_to_array function.

omatthew98 · 2025-01-03T05:19:14Z

python/ray/data/_internal/numpy_support.py

+            # Manually calculate nanoseconds by adding microseconds * 1000
+            nanoseconds = dt.microsecond * 1000 + dt.nanosecond
+            # Now manually create a datetime64 with nanosecond precision
+            return np.datetime64(


I wonder if for this case, we could simply add a timedelta to the lower precision np.datetime created from the original time stamp. This would prevent us from having to format the string directly, which is probably fine. I am envisioning something like this (from ChatGPT):

import datetime import numpy as np # Suppose we have year=2025, month=1, day=2, hour=15, minute=30, second=45, # plus an extra 789 nanoseconds beyond microseconds. # 1 microsecond = 1000 nanoseconds # So total ns = (123456 * 1000) + 789 # But standard datetime can only store the microseconds part (123456): base_dt = datetime.datetime(2025, 1, 2, 15, 30, 45, 123456) leftover_ns = 789 # Convert the base datetime to a ns-based np.datetime64 base_np_dt = np.datetime64(base_dt, 'ns') # Add the leftover nanoseconds as a timedelta64: final_np_dt = base_np_dt + np.timedelta64(leftover_ns, 'ns') print(final_np_dt) # 2025-01-02T15:30:45.123456789

Nice will give this a shot. Yea, converting to string was not ideal, but I was trying a lot of options and none worked so far.

Yeah from what I read it seems like unless the above works (I didn't do too much manual testing) the string formatting is unfortunately what we have to go with. Seems fine to leave as I don't think we are going to be editing this code much, but could be nice to go with the above.

This worked :).

omatthew98 · 2025-01-03T05:21:57Z

python/ray/data/_internal/numpy_support.py

+                f"{dt.second:02d}.{nanoseconds:09d}",
+                "ns",
+            )
+        elif precision == "us":


Nit: Can collapse a lot / all of these into the same branch with something like return np.datetime64(dt).astype(f"datetime64[{precision}]")

Signed-off-by: Srinath Krishnamachari <[email protected]>

## Why are these changes needed? Handle pandas timestamp with nanosecs precision ## Related issue number "Closes ray-project#49297" --------- Signed-off-by: Srinath Krishnamachari <[email protected]>

## Why are these changes needed? Handle pandas timestamp with nanosecs precision ## Related issue number "Closes ray-project#49297" --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>

srinathk10 requested a review from a team as a code owner December 19, 2024 21:54

omatthew98 reviewed Dec 19, 2024

View reviewed changes

srinathk10 added the go add ONLY when ready to merge, run all tests label Dec 23, 2024

omatthew98 approved these changes Dec 26, 2024

View reviewed changes

raulchen reviewed Dec 26, 2024

View reviewed changes

srinathk10 requested review from hongpeng-guo, justinvyu, matthewdeng, woshiyyya, sven1977, maxpumperla, simonsays1980, a team, richardliaw, edoakes, aslonnie and hongchaodeng as code owners January 3, 2025 02:43

Handle pandas timestamp with nanosecs precision

d63bf5f

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 force-pushed the srinathk10-timestamp-nanosecs branch from 229aeb1 to d63bf5f Compare January 3, 2025 02:51

srinathk10 added 3 commits January 3, 2025 02:55

Add comments

0382540

Signed-off-by: Srinath Krishnamachari <[email protected]>

Update comments

8e1f30f

Signed-off-by: Srinath Krishnamachari <[email protected]>

Misc fixes

48e354f

Signed-off-by: Srinath Krishnamachari <[email protected]>

raulchen approved these changes Jan 3, 2025

View reviewed changes

omatthew98 reviewed Jan 3, 2025

View reviewed changes

srinathk10 added 4 commits January 3, 2025 21:28

Addressed review comments

e68d6d7

Signed-off-by: Srinath Krishnamachari <[email protected]>

Addressed review comments

6ad53af

Signed-off-by: Srinath Krishnamachari <[email protected]>

Misc fixes

6a4d7fd

Signed-off-by: Srinath Krishnamachari <[email protected]>

Misc fixes

9807e31

Signed-off-by: Srinath Krishnamachari <[email protected]>

Merge branch 'master' into srinathk10-timestamp-nanosecs

c46d3d9

raulchen merged commit 3df8163 into master Jan 4, 2025
5 checks passed

raulchen deleted the srinathk10-timestamp-nanosecs branch January 4, 2025 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle pandas timestamp with nanosecs precision #49370

Handle pandas timestamp with nanosecs precision #49370

srinathk10 commented Dec 19, 2024

omatthew98 left a comment

omatthew98 Dec 19, 2024

NellyWhads Dec 20, 2024

srinathk10 Dec 23, 2024

omatthew98 Dec 19, 2024

omatthew98 Dec 19, 2024

srinathk10 Dec 23, 2024

omatthew98 Dec 19, 2024

srinathk10 Dec 23, 2024

omatthew98 Dec 19, 2024

omatthew98 Dec 19, 2024

srinathk10 Dec 23, 2024

omatthew98 left a comment

omatthew98 Dec 26, 2024

raulchen Dec 26, 2024

raulchen Dec 26, 2024

raulchen Dec 26, 2024

srinathk10 Dec 31, 2024

raulchen Dec 26, 2024

raulchen Jan 3, 2025

raulchen Jan 3, 2025

srinathk10 Jan 3, 2025

raulchen Jan 3, 2025

omatthew98 left a comment

omatthew98 Jan 3, 2025

srinathk10 Jan 3, 2025

omatthew98 Jan 3, 2025

srinathk10 Jan 3, 2025

omatthew98 Jan 3, 2025

		@@ -45,28 +45,76 @@ def validate_numpy_batch(batch: Union[Dict[str, np.ndarray], Dict[str, list]]) -


		def _detect_highest_datetime_precision(datetime_list: List[datetime]) -> str:
		"""Detect the highest precision for a list of datetime objects.

Handle pandas timestamp with nanosecs precision #49370

Handle pandas timestamp with nanosecs precision #49370

Conversation

srinathk10 commented Dec 19, 2024

Why are these changes needed?

Related issue number

Checks

omatthew98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omatthew98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omatthew98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment